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Abstract 

We present a novel approach for fully non-stationary Gaussian process regression (GPR), where all three key 
parameters - noise variance, signal variance and lengthscale - can be simultaneously input-dependent. We de¬ 
velop gradient-based inference methods to learn the unknown function and the non-stationary model parameters, 
without requiring any model approximations. We propose to infer full parameter posterior with Hamiltonian 
Monte Carlo (HMC), which conveniently extends the analytical gradient-based GPR learning by guiding the 
sampling with model gradients. We also learn the MAP solution from the posterior by gradient ascent. In ex¬ 
periments on several synthetic datasets and in modelling of temporal gene expression, the nonstationary GPR 
is shown to be necessary for modeling realistic input-dependent dynamics, while it performs comparably to 
conventional stationary or previous non-stationary GPR models otherwise. 


1 Introduction 

Gaussian process regression has emerged as a powerful, yet practical class of non-parametric Bayesian models 
that quantify the uncertainties of the underlying process using Gaussian distributions (Rasmussen and Williams, 
2006). Gaussian processes are commonly applied to time-series interpolation, regression and classification, where 
the GP can provide predictive distributions (Rasmussen and Williams, 2006). 

The standard GP model assumes that the model parameters stay constant over the input space. This includes 
the observational noise variance as well as the signal variance and the lengthscale i of the covariance 
function. The signal variance determines the signal amplitude, while the characteristic lengthscale defines the 
local ‘support’ neighborhood of the function. In many real world problems either the noise variance or the signal 
smoothness, or both, vary over the input space, implying a heteroscedatic noise model or a nonstationary function 
dynamics, respectively (Le et ah, 2005) (see also Wang and Neal (2012)). In both cases, the analytical posterior 
of the GP becomes intractable (Tolvanen et ah, 2014). For instance, in biological studies, rapid signal changes are 
often observed quickly after perturbations, with the signal becoming smoother in time (Heinonen et ah, 2015). 

Several authors have proposed extending GPs by learning a latent noise variance as another GP, and by infer¬ 
ring the unknown function and the noise model in a maximum likelihood (ML) (Kersting et ah, 2007) or max¬ 
imum a posteriori (MAP) fashion (Quadrianto et ah, 2009). Fully Bayesian inference methods include MCMC 
sampling (Goldberg, 1998) and variational and expectation propagation approximations of the posterior (Lazaro- 
Gredilla and Titsias, 2011; Tolvanen et ah, 2014). Non-stationarities can also be included in the signal variance or 
lengthscale with Gaussian process priors. Nonstationary lengthscale was introduced by Gibbs (1997) and further 
extended by Paciorek and Schervish (2004) with MCMC inference. Recently, Tolvanen et al. (2014) introduced a 
non-stationary signal variance using expectation propagation and approximate variational inference. 

In this paper we introduce the first fully non-stationary and heteroscedastic GP regression framework, in which 
all three main components (noise variance, signal variance and the lengthscale) can be simultaneously input- 
dependent, with GP priors'. We propose an inference method for the exact joint posterior of the underlying signal 
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and all three latent functions, avoiding the need for introducing variational or expectation propagation approxi¬ 
mations (Lazaro-Gredilla and Titsias, 2011; Tolvanen et ah, 2014). We use HMC-NUTS, which can effectively 
sample the posterior guided by the model gradients, which we derive analytically. Furthermore, an exact MAP 
solution arises as a simple gradient ascent of the posterior. We enhance both approaches by posterior whitening 
using Cholesky decompositions of the latent function priors. Our experiments demonstrate the necessity of non¬ 
stationary GPR to model realistic input-dependent dynamics, while the proposed method performs comparably to 
conventional stationary or previous non-stationary GPR models otherwise. 

In Section 2 we introduce the fully nonstationary GP model. In its subsections we first introduce MAP and 
HMC inference, discuss model whitening and finally define the predictive distributions. Section 3 presents exper¬ 
imental results on several synthetic and one real biological datasets, and we conclude in Section 4. 

2 Heteroscedatic nonstationary GP model 

Let y = {yi)2=i € R” be an observation vector over n inputs x = {xi)2=i e K”. We assume an additive 
regression model 


y{x) = f{x) + e{x), e{x) ^ J\f{0,uj{xf), 

where both the underlying signal f{x) and the zero-mean observation noise variance uj{x)‘^ are unknown functions 
to be learned. We proceed by first placing a zero mean GP prior on the unknown function f{x), 

fix)^GPiO,Kf{x,x')), (1) 

which assumes that the output covariances cov(/(x), f{x')) = Kf{x, x') depend on the input covariance through 
a kernel function. We use a nonstationary generalisation of the squared exponential kernel (Gibbs, 1997) 


Kf{x,x') 


a{x)a{x) 



exp 


f \ 


( 2 ) 


where x, x' G M, and o'(x) and £(x) are input-dependent signal variance and lengthscale functions, respectively. 
The kernel reduces into a standard squared exponential kernel if both are constant. We show the kernel (2) is 
positive definite in the Supplementary Material. 

We model the lengthscale, signal variance and noise variance with latent functions. We are interested in 
smoothly varying latent functions and thus we place separate GP priors on them as well: 

log(f(f)) = i{t) ~ GP{ni,Ke{x,x')) 
log(cr(f)) = &{t) ~ GP{y,a,Ka{x,x')) 

\og{uj{t)) = oj{t) ~ GP{y,^,Ku:{x, x')), 


where we set the priors on the logarithms to ensure their positivity. We select separate standard squared exponential 
covariances for each, 

Ka(x, X ) = a^ exp ^- ^ — 

where c € {f, a, w}. The model has nine hyper-parameters 9 = [yti, yta, ai, /3r, Pa, Puj) that define 

the prior for the three latent functions f, a and Cj. The means y determine latent function means, while the a’s are 
scaling terms. The /?’s are the characteristic lengthscales of the priors. In practice, the /r’s and a’s have a small 
effect on the models, whereas /3’s determine the smoothness of the latent functions, and can be set based on prior 
knowledge. 

Given a dataset (x, y), the model can be equivalently written as £\t, cr ~ A/'(0, Kf), where f = {f{xi))f^i 
is a latent function vector at observed points x and Kf € has elements [Kf]ij = Kf{xi,Xj) computed 

using eq. (2) with signal variances cr — {<j{xi))2^i and lengthscales £ = (£(xi))lfi. Finally, the data likelihood 
is y|£, cr ^ Af(0, Kf + SI), where S7 = diaga;^ g is a diagonal noise matrix and ll> = {uj{xi))'^^i are the 

noise standard deviations. 

We note that by placing a GP prior on just the noise and restricting the other two to be constants, we arrive 
at the heteroscedastic model studied in several earlier works (Goldberg, 1998; Kersting et ah, 2007; Quadrianto 
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et al., 2009). Setting a prior on only the lengthscale retrieves the models of Gibbs (1997); Paciorek and Schervish 
(2004), and setting a prior on both the signal variance and the noise gives the model of Tolvanen et al. (2014). Out 
method is the first to combine heteroscedatic noise and nonstationary lengthscale, while also allowing the signal 
variance to vary over the inputs. 

To infer latent functions from the full posterior p{i, £, a, d>|y, 9) we introduce two approaches in the next two 
Sections^. We propose to learn the MAP estimate p(f |.^map 7 ^map 7 ‘^map 7 y )7 or infer the full posterior using HMC 
sampling. Both approaches are based on the analytical gradients of the latent functions. 

2.1 Maximum a posteriori estimation 

As the first approach, we follow the approaches by Kersting et al. (2007) and Quadrianto et al. (2009), and resort 
to finding the MAP solution of the latent posterior p{£, cf, tujy), 

^map 7 0 'map 7 ‘^map = argmaxp(l,o-,d>|y), 

where f has been marginalised out. Using Bayes’ theorem this is equivalent to maximizing the marginal likelihood 

= p(y|^,ff,d>)p(l,o-,d>) = Af{y\0,Kf + n)Af{£\p(,Ke)Af{d-\p„,K„)Af{u:\p^,K^), (3) 

whose logarithm we denote as the marginal log likelihood (MLL). 

The partial derivatives of the log of marginal likelihood (3) with respect to the latent functions are analytical: 

= 0.5 tr - [Kj\£ - Mf)]* 

= dia,g - Kf^)Kf) - - p,i^) (4) 

= diag {{aa^ - Kf^)n) - - pq) 

where ct = {Kj + fl)“^y and is given in the Supplementary Material. 

We perform gradient ascent over the MLL, log£. The solution is only guaranteed to converge to a local 
optimum, and hence we perform multiple restarts from random initial conditions. The MAP solution is adequate 
when the posterior is close to unimodal. 

Given the MAP solution, the function posterior p(f|.^MAP 7 < 5 'map 7 d^MAp) ~ -^(Mmapi ^map) is ^ Gaussian with 

/^MAP = KJ {Kf + Umap)” V 
Emap = Kf-Kj (Kf + Umap)-^7^/7 

where Kf has been computed with eq. (2) using MAP latent vectors log(£) = £map and log((T) = o-map 7 and Umap 
with log(u;) = Wmap- 

2.2 HMC inference 

As a second approach we sample the full posterior using Hamiltonian Monte Carlo (HMC) (Hoffman and Gelman, 
2014; Neal, 2011). In HMC, we introduce an additional momentum variable for each of the model variables and 
interpret the extended model as a Hamiltonian system. We simulate time evolution of the Hamiltonian dynamics 
to produce proposals for the Metropolis algorithm. This simulation step makes use of the gradient of the log joint 
density 

p(f, £, a, d>; y) = p(y |f, d>)p(f |£, o-)p(l| 6 »)p(dr| 6 »)p(d>| 6 »). 

However, in a GP regression the function posterior p(f |y) with latent functions £, a, d> marginalised out, can be 
integrated conveniently by 

P(f |y) = JJJ y)p(^’ d;|y)d£do-dw (5) 

^ m 

m. 


^In the following we omit the hyperparameters 0 for notational clarity 


a log/: 

dh 

a log/: 

da 

a log/: 

duj 


3 



where 


r-. p(i^a,uj\y) (6) 

are m HMC samples of the latent posterior. The function posteriorp(f |€i, y) = Si) for each HMC 

sample is a Gaussian with 

^,=Kp-Kl{Kp+ni)-^Kf^, 

where Kf. is a nonstationary kernel matrix computed using li and a-i, and fli is the diagonal noise covariance 
matrix of CJi. 

Hence, we only need to do HMC sampling over the three latent vectors (£, cr,tD) and the posterior of f 
follows analytically as a mixture of m Gaussians. The latent posterior p(£, (t, tDjy) is proportional to the marginal 
likelihood in eq. (3), and thus the HMC sampling of (£, cr, tD) uses the same gradients ^ ^^ ^ ^ 

from eq. (4) as the MAP solution. 


2.3 Posterior whitening 

The posterior of the latent vectors is by definition highly correlated due to Gaussian priors, leading to inefficient 
Monte Carlo sampling. To ease the sampling, we perform the sampling over the whitened latent vectors (Kuss and 
Rasmussen, 2005) 

I = LJ^l, Ki = LiLj 

- ^(7 ^ -^^<7 - ^( 7^(7 

= L^LI, 

with Cholesky decompositions of the corresponding GP prior covariances, which are fixed based on the hyperpa¬ 
rameters 6. The derivative for e.g. the lengthscale becomes f' = LTV^C, where the last term 

at oLet at t -c 

is the standard gradient of the non-whitened model defined in eq. (4). 


2.4 Making predictions 

Both the MAP solution and the HMC sampler infer values of the latent functions only at the n observed inputs x. 
To extrapolate the values of the unknown function and the latent functions over arbitrary target points x* € M"*, 
we define the predictive distribution, where we extrapolate the latent functions cr,d> to £*,cr*,d;*, and then 
express the function posterior f* with them (Goldberg, 1998). With the MAP solution we have 




MAP; cr MAP; ^MAP; 


y) = / / p(f*|£MAP,-^* 


*; cr MAP; CT *, 




p,y)p{i*\i MAP )p( 


* (TIV 


p)d£^da^. 


1 

S 


s 

'y ' P(f*|^MAP; £jt ; ^ MAP; ; ^^MAP; Y) 


(7) 


where we approximate the integral by drawing s samples of n* dimensions from the condi¬ 

tional Gaussians £* |£map and d"* |<Tmap (See Supplementary Material). This results in a mixture of s corresponding 
Gaussians A/’(/Xmap j*; ^MAP.i*), where 

Pmapj* 

^MAPJ* 

and where G ATmapj* G and ATmap.map G are computed with eq. (2) over the 

latent vectors (£map;^map) over inputs x, or using {£j„,d^j„) over inputs x*. The simplest approximation is to 
choose only a single sample s = 1 from the conditionals by choosing the conditional means £j^ = E[£*|£map] 
and = E[cf*|crMAp] (See Supplementary Material). This is a sufficient approximation if the inputs x are 
sufficiently dense. 


= K, 


MAP,J, 


(K, 


MAP,MAP -f ^MAp) 


= K 




- 


(A, 


MAP,MAP 


T f^MAp) A; 


MAP,j* ; 
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The predictive distribution given the HMC sample {£i, is derived analogously. We average over the 

m HMC samples instead of a single MAP solution, and over the s samples {£ij^}j^i and from the 

conditionals, resulting in 




^ ^ ^ ^ Pi^ir \ii7 ^ij* ; 5 i Y ) 


ms 


i=i j=i 


.me 


ms 


i=i j=i 


( 8 ) 


where = Kfj^ {Ki + and and where the kernel matrices 

are computed using £.i, cf j and £ij^, . 

We note that a slower but perhaps more elegant alternative is to model latent functions jointly over concatenated 
inputs X( = (x, X*), resulting in £t = {£, £*), and analogously for the other functions. In this case the function 
posterior contains the predictive posterior, but the latent vector sizes increase to n + n*. 


3 Experiments 

We assess the performance of the proposed method on several synthetic and real datasets. We experiment with 8 
synthetic datasets and a gene expression time series dataset (Heinonen et al., 2015). Of the synthetic datasets, three 
datasets are from the literature; the motorcycle dataset M (Silverman, 1985), the ‘jump’ dataset J (Paciorek and 
Schervish, 2004) and a nonstationary dataset T from GPstuff (demo_epinf in Vanhatalo et al. (2013)). We also 
generated five synthetic datasets with different types of nonstationarities (See Table 1). We expect datasets exhibit¬ 
ing specific types of input-dependent characteristics to require a model with the corresponding input-dependencies. 

Table 1: Datasets with varying forms of non-stationarities and sizes. 


Dataset 

Non-stationary functions 

n 

‘Strain 

Comment 


uj{i) 

133 

67 

Motorcycle dataset (Silverman, 1985) 

D. 

a{t) 

100 

50 



£{t) 

150 

75 




100 

50 



£{t),uj{t) 

150 

75 




90 

45 


J 

N/A 

101 

50 

Jump dataset (3rd) (Paciorek and Schervish, 2004) 



500 

250 

from GPstuff demo_epinf (Vanhatalo et al., 2013) 


All dataset outputs are normalised to range [—1,1] and inputs to range [0, Ij. For each dataset, we use 
half of the data as training data and the rest as test data. We will assess the performance against the test data 
with mean squared error MSB = ~ '^he negative log probability density NLPD = 

values. For consistency, we model stationary parameters as vectors cl of length 

Strain for C — j , fJ, f 

We run MAP optimisation from 10 different initial conditions and choose the one with the highest MLL value. 
We run 10 chains of 1000 samples of HMC-NUTS sampling using model whitening (Algorithm 3 of Hoffman 
and Gelman (2014), e = 0.01, maximum tree depth 10). The hyperparameters are set to a = 1, and m = 0.2, 
Per = 0.5 and = 0.1. We set the /3 parameters to sensible values of I3i = /3a = 0.1 and j3i^ = 0.2. 

3.1 Regression performance 

Table 2 shows the MSB and NBPD performance on the test folds of the synthetic datasets using various MAP 
GP models. For each dataset, the model with the smallest NBPD, or second smallest NBPD value, is the one 
where the model’s nonstationarities match those of the dataset. For instance, the dataset T contains heteroscedatic 
noise and input-dependent cr. For this, the best performance is obtained with a matching w, ct-GP and with a fully 
nonstationary w, cr, f-GP as well. 
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Table 2: Test MSB and NLPD results on the synthetic datasets over various MAP models. Optimal values are in 
boldface. The optimal or second to optimal NLPD values follow a diagonal line. Smaller values are better for 
both quantities. 


Method 


D 


Df 

D, 

J.a 

D, 



a.£ 

j 

T, 


MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

MSE 

NLPD 

GP 

3.91 

0.11 

0.21 

-1.31 

0.71 

-0.85 

0.44 

-0.86 

0.54 

-1.04 

0.16 

-1.25 

1.48 

-0.32 

17.83 

0.10 

(w)-GP 

3.91 

-0.22 

0.21 

-1.27 

2.46 

-0.47 

0.45 

-0.98 

2.69 

-1.28 

0.33 

-1.73 

3.51 

-0.58 

17.35 

0.01 

(ct)-GP 

3.93 

0.17 

0.18 

-1.37 

0.64 

-0.93 

0.43 

-0.94 

0.52 

-1.07 

0.17 

-0.42 

1.38 

1.04 

16.84 

0.07 

(e)-GP 

3.94 

0.16 

0.18 

-1.38 

0.53 

-1.05 

0.44 

-0.88 

0.41 

-1.16 

0.17 

-0.34 

1.35 

-0.72 

17.63 

0.09 

(w,ct)-GP 

3.87 

-0.23 

0.18 

-1.30 

0.65 

-0.86 

0.43 

-1.07 

0.56 

-1.51 

0.15 

-1.61 

1.60 

-0.66 

16.33 

-0.02 

(ujJj-GP 

4.02 

-0.19 

0.19 

-1.31 

0.53 

-0.93 

0.42 

-0.99 

0.40 

-1.83 

0.17 

-0.90 

1.38 

-0.47 

9.30 

0.01 

(uj, (7, .^)-GP 

3.90 

-0.21 

0.19 

-1.32 

0.53 

-0.90 

0.45 

-0.98 

0.43 

-1.70 

0.16 

-1.79 

1.47 

-0.32 

9.77 

-0.00 


0.5 


MAP vs HMC 
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• HMC samples 
A HMC mean 

• MAP solutions 


-1.1 


• HMC solutions 




# MAP solution 


1 2 3 

MSB 

(a) All datasets and models 


-1.15 


0.45 0.5 0.55 0.6 

MSE 

(b) dataset with £-GP model 


0.65 


0.7 


Figure 1: The MSE and NLPD performance of the HMC posterior samples, (a) Comparison of test errors between 
MAP and HMC mean solutions over all datasets and methods (8x7 = 56, x-axis limited to 5 for clarity), (b) 
Test errors of the HMC samples compared to the HMC mean and MAP solution on a single Di dataset with f-GP 
model. 


With MSE the stationary GP performs slightly better, albeit still worse than the optimal non-stationary method. 
This applies for every dataset. 

Adding ‘unnecessary’ nonstationarities retains or only slightly worsens the performance, with the major ex¬ 
ception being the ‘jump’ dataset J. Here, the lengthscale is clearly input-dependent (NLPD —0.72, optimal), while 
in contrast the nonstationary signal variance a is unable to model the data (NLPD 1.04). Adding Heteroscedatic 
noise to any of the models of this dataset weakens the model. 

3.2 HMC performance 

We explored the difference between the MAP solution and the HMC sampling. In practice we found the MAP 
to be slightly better on average regarding the MSE and NLPD values (See Figure la). However, the sampling 
solution is robust against multimodality in the latent posterior. Figure lb shows the test errors of the individual 
HMC samples in comparison to the MAP solution with the dataset using the AGP model. The HMC solution 
includes numerous samples that are better, while on average being slightly worse than MAP. 

The dataset contains several latent modes (See Figure 2, bottom), which the HMC sampler captures. These 
modes include latent functions that imply a ‘shortcut’ or a ‘zigzag’ signal around timepoints 0.18 or 0.75, or 
both. The HMC samples are centered mostly around the shortcut profile at the earlier timepoint, while only a few 
samples with the shortcut profile exist at the later timepoint. The MAP solution has chosen both zigzags. The 
latent posterior shows largest variance in the signal variance a(t) component, while the lengthscale £(t) and noise 
variance w have tighter distributions. 
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Nonstationary function 95% posterior 




time 


Figure 2; The set of function posteriors corresponding to the latent function samples are plotted in gray, and the 
MAP solution in green, along with the training and test data (top). The latent function sample is drawn with green 
(a), red {£) and blue (w) colors, with the MAP solution highlighted with bold lines (bottom). 


3.3 Biological dataset 

We showcase the method with a biological dataset of 205 gene expression time series measurements of human 
endothelial cells after irradiation at time t = 0. Due to the irradiation the dataset exhibits nonstationary dynamics 
as the cells try to repair themselves and revert back to steady states. The gene expressions are measured over 
8 days (0.5,1, 2, 3,4, 7,14, 21) in three replicates (Heinonen et al., 2015). The goal is to construct a realistic 
model of the underlying gene expression process and the underlying dynamics with no knowledge of the ‘true’ 
expression levels, given only the small number of sparse measurements. 

We modeled the dataset using stationary GP, heteroscedatic w-GP and three nonstationary GPs: {oj, cr)-GP, 
(w, ^)-GP and with (w, a, £)-GP. We found the performance of the three nonstationary GPs to be similar. Figure 
3 indicates the MLL, MSE and NLPD values of the 205 timeseries under stationary, heteroscedatic or fully non¬ 
stationary models. Addition of heteroscedasticity greatly increases the model fits, while also improving the data 
likelihoods against the function posterior. Finally the fully nonstationary GP still improves model fits, while con¬ 
sistently improving the NLPD values, with similar MSE performance compared to the HGP. Eigure 4 compares 
the three models learned from an an example gene expression time series. 

3.4 Latent function reconstruction 

Einally, we highlight the methods potential in reconstructing the latent parameter processes. We simulate a noisy 
sample where true generating latent parameter functions £{■), a{-),uj{-) are known. We computed both the MAP 
solution and sampled the posterior of the latent space and of the unknown function /(•) using FIMC inference 
and showcase the results in Figure 5. The latent function reconstruction error curve against increasing numbers of 
noisy points shows that the regression error, and lengthscale and noise variances converge to true values (Figure 5 
top), while the signal variance shows small bias but is still well estimated. 

Figure 5 bottom highlights the MAP and HMC solutions given 150 datapoints, and compares them to the 
state-of-the-art a, w-GP model of Tolvanen et al. (2014). We are able to accurately estimate the latent functions. 
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Figure 3: Comparison of the MLL, MSE and NLPD values (x-axis) of the stationary GP, w-GP and (w, a, ^)-GP 
over the 205 gene expression time series (x-axis) against the heteroscedastic GP on the y-axis. Each row contains 
a triplet of values corresponding to the three GP models of the same time series. 


(o.s.n-GP function 95% Dosterior 



0 0.2 0.4 0.6 0.8 1 

time 




time 




Eigure 4; Comparison of three GP models on an example expression time series. 


4 Discussion 

In this paper, we have proposed a fully non-stationary Gaussian process regression framework, where all three key 
components can be input-dependent. Our approach uses analytical gradient-based techniques to perform inference 
with HMC sampling and MAP estimation. We are able to effectively sample from the exact posterior of the latent 
functions. We have shown that the method is able to infer the underlying latent functions and improve regression 
performance when the datasets truly are nonstationary, and achieve equivalent performance to a stationary model 
when they are not. 

The interplay between the signal variance and the lengthscale is an interesting topic (Diggle et al., 1998; Zhang, 
2004). When modeling the ‘jump’ dataset the signal variance was unable to model dynamics, while nonstationary 
lengthscale produced a good model. This is natural since the signal variance serves as a linear amplitude, while 
the lengthscale has a possibly non-linear effect on the function model. 

The gradient-based HMC is a powerful inference tool for Gaussian processes, and could be further enhanced 
by utilizing natural gradients or position-dependent mass matrices with Riemannian Manifold HMC (Girolami 
and Calderhead, 2011). We note that the method could be extended by also inferring the hyperparameters 8 using 
HMC. However, proper care has to be taken to set their priors. 
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Latent function reconstruction error 





Noise variance posterior 

0.1 I-_ 


Samples 

^ --^ 

— — MAP lengthscale 


-Generating lengthscale 


— — so-GP noise variance 

✓ ^ ^ 

^-- 
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0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(b) 150 data points 


Figure 5: Latent function reconstruction errors (top) over a nonstationary simulated dataset with known nonsta¬ 
tionary dynamics. Fully nonstationary GP model (bottom) posteriors, MAP solution (dashed line) and generating 
latent functions (black solid lines) with 150 data points. The latent lengthscales and noises are estimated correctly, 
while signal variance is approximately matched. Dashed black lines shows the comparison to the state-of-the-art 
cr, w-GP model of Tolvanen et al. (2014). 


9 






































Supplemental information 

Comparison between vanilla and (a;, a, £)-GP 

A comparison between stationary (vanilla) and (w, u, £)-GP is shown in Supplemental Figure 6. 




time time 


(a) Vanilla GP (NLPD -1.95) (b) Fully nonstationary GP (NLPD -2.47) 


Figure 6: Nonstationary, heteroscedastic GP (right) fits nonstationary dataset better than vanilla GP (left). Vanilla 
GP overestimates confidence intervals at stationary regions, as a result of having to choose the lengthscale to ht 
the nonstationary regions, and for the same reason produces spurious oscillations in regions where there are no 
observations. The dataset is the Duj,(t,£ from Table 2. 


Kernel SDP proof 


The kernel 


K(x, x') = 


I 2i{x)i{x') 

£{xy + £{x'y 


exp - 


{x — x')"^ 

+ £{x'y 


is a one-dimensional Gaussian SDP kernel x^ ) of Paciorek and Schervish (2004). The kernel Kj can be 

stated as 

Kf{x,x') = u{x)K{x,x')a{x'), 

which is positive dehnite for any function (t( ) (Shawe-Taylor and Christianini, 2004). 


Conditional distributions 

The conditional distributions of the latent functions at target timepoints x* given the latent functions at observed 
timepoints x are 

p(C|^) = A/'(C|Ar£(x*, x.)'^Ki{x, x)“^(£ - Hi) + Hi, x*) - Ki{x, x)) 

- Ha) + “ K„{x^,x)'^K„{x,x)~'^K„{x^,,x)) 

p(a>*|a>) = A/'(d)*|A:^(x*,x)^iT<^(x,x)“^(a> - Hu,) + Hu,, Ku,{x*,x*) - iTa;(x*, x)^A:^(x, x)“^A:^(x*, x)), 

where Ki, and are standard gaussian kernels computed using the hyperparameters 6 (Rasmussen and 
Williams, 2006). 
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Partial derivatives 

The partial derivatives of the unconstrained latent functions against the marginal log likelihood 

log£ = logp(y|£,d-,d>,6») = logp(y|l,d-,d>)p(l|6»)p(d-|6»)p(d;|6») 

= log A/'(y|0, itT/ + n)+ log K^) + log A/'(cr|^^, + log K^j) 


are analytically derived. 

The partial derivative for £{t) is 


d\ogC 

dh 

dh 


i tr (^(aa^ - K, - [I<i Hi - />,')!■ 


Sij = CTiCTj, Rij = ^Eij = exp and + £]■ The derivative matrix becomes a 

’plus’ matrix where only I’th column and row are nonzero. 

The derivatives for a{t) is 

= diag ({aa^ - K-^)Kf) - - /r^) 

and for Cj{t) is 

^^1^ = diag {{cxa^ - K-^)n) - 

where a = K~^y. 

The partial derivatives of the latent parameters {£, a, oj) € in a stationary formulation are for £ (Rasmussen 
and Williams, 2006) 


for a 


and for Co 


ar 

di ' 

dKy _ 
d£ 

dL 


(K 

duo 


i tr - Ky “ Mf) 

£-‘^DQKf, 

tr ((aa"^ - K-^)Kf) - K^^al - 
tr ((aa^ - K;^)n) - 1^K^\C;1 - fia.). 
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